OcrV1, Main, Exploration, bibRecord, 000783

Efficient Automatic OCR Word Validation Using Word Partial Format Derivation and Language Model

Identifieur interne : 000783 ( Main/Exploration ); précédent : 000782; suivant : 000784

Efficient Automatic OCR Word Validation Using Word Partial Format Derivation and Language Model

Auteurs : SIYUAN CHEN [États-Unis] ; Dharitri Misra [États-Unis] ; George R. Thoma [États-Unis]

Source :

Proceedings of SPIE, the International Society for Optical Engineering [ 0277-786X ] ; 2010.

RBID : Pascal:10-0429695

Descripteurs français

Pascal (Inist)
- Calcul erreur, Reconnaissance forme, Recherche documentaire, Reconnaissance optique caractère, Implémentation, Application médicale, Lexique, Articulation, Modèle de n grams, Estimation erreur, 0130C, 4230S.
Wicri :
- topic : Recherche documentaire.

English descriptors

KwdEn :
- Document retrieval, Error analysis, Error estimation, Implementation, Joint, Lexicon, Medical application, Optical character recognition, Pattern recognition, n gram model.

Abstract

In this paper we present an OCR validation module, implemented for the System for Preservation of Electronic Resources (SPER) developed at the U.S. National Library of Medicine The module detects and corrects suspicious words in the OCR output of scanned textual documents through a procedure of deriving partial formats for each suspicious word, retrieving candidate words by partial-match search from lexicons, and comparing the joint probabilities of N-gram and OCR edit transformation corresponding to the candidates. The partial format derivation, based on OCR error analysis, efficiently and accurately generates candidate words from lexicons represented by ternary search trees. In our test case comprising a historic medico-legal document collection, this OCR validation module yielded the correct words with 87% accuracy and reduced the overall OCR word errors by around 60%.

Affiliations:

Links toward previous steps (curation, corpus...)

to stream PascalFrancis, to step Corpus: 000163
to stream PascalFrancis, to step Curation: 000614
to stream PascalFrancis, to step Checkpoint: 000157
to stream Main, to step Merge: 000788
to stream Main, to step Curation: 000783

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Efficient Automatic OCR Word Validation Using Word Partial Format Derivation and Language Model</title>
<author><name sortKey="Siyuan Chen" sort="Siyuan Chen" uniqKey="Siyuan Chen" last="Siyuan Chen">SIYUAN CHEN</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>U.S. National Library of Medicine</s1>
<s2>Bethesda, MD 20894</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Maryland</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Misra, Dharitri" sort="Misra, Dharitri" uniqKey="Misra D" first="Dharitri" last="Misra">Dharitri Misra</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>U.S. National Library of Medicine</s1>
<s2>Bethesda, MD 20894</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Maryland</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Thoma, George R" sort="Thoma, George R" uniqKey="Thoma G" first="George R." last="Thoma">George R. Thoma</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>U.S. National Library of Medicine</s1>
<s2>Bethesda, MD 20894</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Maryland</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">10-0429695</idno>
<date when="2010">2010</date>
<idno type="stanalyst">PASCAL 10-0429695 INIST</idno>
<idno type="RBID">Pascal:10-0429695</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000163</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000614</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000157</idno>
<idno type="wicri:doubleKey">0277-786X:2010:Siyuan Chen:efficient:automatic:ocr</idno>
<idno type="wicri:Area/Main/Merge">000788</idno>
<idno type="wicri:Area/Main/Curation">000783</idno>
<idno type="wicri:Area/Main/Exploration">000783</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Efficient Automatic OCR Word Validation Using Word Partial Format Derivation and Language Model</title>
<author><name sortKey="Siyuan Chen" sort="Siyuan Chen" uniqKey="Siyuan Chen" last="Siyuan Chen">SIYUAN CHEN</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>U.S. National Library of Medicine</s1>
<s2>Bethesda, MD 20894</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Maryland</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Misra, Dharitri" sort="Misra, Dharitri" uniqKey="Misra D" first="Dharitri" last="Misra">Dharitri Misra</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>U.S. National Library of Medicine</s1>
<s2>Bethesda, MD 20894</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Maryland</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Thoma, George R" sort="Thoma, George R" uniqKey="Thoma G" first="George R." last="Thoma">George R. Thoma</name>
<affiliation wicri:level="2"><inist:fA14 i1="01"><s1>U.S. National Library of Medicine</s1>
<s2>Bethesda, MD 20894</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName><region type="state">Maryland</region>
</placeName>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">Proceedings of SPIE, the International Society for Optical Engineering</title>
<title level="j" type="abbreviated">Proc. SPIE Int. Soc. Opt. Eng.</title>
<idno type="ISSN">0277-786X</idno>
<imprint><date when="2010">2010</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">Proceedings of SPIE, the International Society for Optical Engineering</title>
<title level="j" type="abbreviated">Proc. SPIE Int. Soc. Opt. Eng.</title>
<idno type="ISSN">0277-786X</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Document retrieval</term>
<term>Error analysis</term>
<term>Error estimation</term>
<term>Implementation</term>
<term>Joint</term>
<term>Lexicon</term>
<term>Medical application</term>
<term>Optical character recognition</term>
<term>Pattern recognition</term>
<term>n gram model</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Calcul erreur</term>
<term>Reconnaissance forme</term>
<term>Recherche documentaire</term>
<term>Reconnaissance optique caractère</term>
<term>Implémentation</term>
<term>Application médicale</term>
<term>Lexique</term>
<term>Articulation</term>
<term>Modèle de n grams</term>
<term>Estimation erreur</term>
<term>0130C</term>
<term>4230S</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Recherche documentaire</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">In this paper we present an OCR validation module, implemented for the System for Preservation of Electronic Resources (SPER) developed at the U.S. National Library of Medicine The module detects and corrects suspicious words in the OCR output of scanned textual documents through a procedure of deriving partial formats for each suspicious word, retrieving candidate words by partial-match search from lexicons, and comparing the joint probabilities of N-gram and OCR edit transformation corresponding to the candidates. The partial format derivation, based on OCR error analysis, efficiently and accurately generates candidate words from lexicons represented by ternary search trees. In our test case comprising a historic medico-legal document collection, this OCR validation module yielded the correct words with 87% accuracy and reduced the overall OCR word errors by around 60%.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
<region><li>Maryland</li>
</region>
</list>
<tree><country name="États-Unis"><region name="Maryland"><name sortKey="Siyuan Chen" sort="Siyuan Chen" uniqKey="Siyuan Chen" last="Siyuan Chen">SIYUAN CHEN</name>
</region>
<name sortKey="Misra, Dharitri" sort="Misra, Dharitri" uniqKey="Misra D" first="Dharitri" last="Misra">Dharitri Misra</name>
<name sortKey="Thoma, George R" sort="Thoma, George R" uniqKey="Thoma G" first="George R." last="Thoma">George R. Thoma</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000783 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000783 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:10-0429695
   |texte=   Efficient Automatic OCR Word Validation Using Word Partial Format Derivation and Language Model
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Efficient Automatic OCR Word Validation Using Word Partial Format Derivation and Language Model

Efficient Automatic OCR Word Validation Using Word Partial Format Derivation and Language Model

Source :

Descripteurs français

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri